Raed Al-falahy
Healthcare Data Visualization a Case Study of Cancer/Malignant Neoplasms
import seaborn as sns
import warnings
import pandas as pd
import numpy as np
# Temporarily suppress warnings
with warnings.catch_warnings():
warnings.simplefilter('ignore')
# Read in the dataset
oecd_cancer_df = pd.read_csv('OECD_CANCER.csv')
# Replace 'China (People's Republic of)' with 'China'
oecd_cancer_df['Country'] = oecd_cancer_df['Country'].str.replace("China \(People's Republic of\)", "China")
# Replace 'Malignant neoplasms' with 'MN'
oecd_cancer_df['Variable'] = oecd_cancer_df['Variable'].str.replace('Malignant neoplasms', 'MN')
# Display the modified dataframe
print(oecd_cancer_df.shape)
oecd_cancer_df.head(6)
(2976, 11)
| VAR | Variable | UNIT | Measure | COU | Country | YEA | Year | Value | Flag Codes | Flags | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CANCTOCA | MN | PERCMTTX | Incidence per 100 000 population | AUS | Australia | 2002 | 2002 | 312.0 | NaN | NaN |
| 1 | CANCTOCA | MN | PERCMTTX | Incidence per 100 000 population | AUS | Australia | 2008 | 2008 | 314.1 | NaN | NaN |
| 2 | CANCTOCA | MN | PERCMTTX | Incidence per 100 000 population | AUS | Australia | 2012 | 2012 | 323.0 | NaN | NaN |
| 3 | CANCCOLC | MN of colon | PERCMTTX | Incidence per 100 000 population | AUS | Australia | 2002 | 2002 | 41.7 | NaN | NaN |
| 4 | CANCCOLC | MN of colon | PERCMTTX | Incidence per 100 000 population | AUS | Australia | 2008 | 2008 | 38.7 | NaN | NaN |
| 5 | CANCCOLC | MN of colon | PERCMTTX | Incidence per 100 000 population | AUS | Australia | 2012 | 2012 | 38.4 | NaN | NaN |
oecd_cancer_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2976 entries, 0 to 2975 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VAR 2976 non-null object 1 Variable 2976 non-null object 2 UNIT 2976 non-null object 3 Measure 2976 non-null object 4 COU 2976 non-null object 5 Country 2976 non-null object 6 YEA 2976 non-null int64 7 Year 2976 non-null int64 8 Value 2976 non-null float64 9 Flag Codes 0 non-null float64 10 Flags 0 non-null float64 dtypes: float64(3), int64(2), object(6) memory usage: 255.9+ KB
oecd_cancer_df.drop(['Flag Codes', 'Flags', 'COU', 'YEA', 'UNIT', 'VAR'], axis = 1, inplace = True)
oecd_cancer_df.head(10)
| Variable | Measure | Country | Year | Value | |
|---|---|---|---|---|---|
| 0 | MN | Incidence per 100 000 population | Australia | 2002 | 312.0 |
| 1 | MN | Incidence per 100 000 population | Australia | 2008 | 314.1 |
| 2 | MN | Incidence per 100 000 population | Australia | 2012 | 323.0 |
| 3 | MN of colon | Incidence per 100 000 population | Australia | 2002 | 41.7 |
| 4 | MN of colon | Incidence per 100 000 population | Australia | 2008 | 38.7 |
| 5 | MN of colon | Incidence per 100 000 population | Australia | 2012 | 38.4 |
| 6 | MN of lung | Incidence per 100 000 population | Australia | 2002 | 28.2 |
| 7 | MN of lung | Incidence per 100 000 population | Australia | 2008 | 25.6 |
| 8 | MN of lung | Incidence per 100 000 population | Australia | 2012 | 27.0 |
| 9 | MN | Incidence per 100 000 population | Austria | 2000 | 250.7 |
# Group the DataFrame by the "Measure" column
grouped_df = oecd_cancer_df.groupby("Country")
# Get the number of rows in each group
grouped_counts = grouped_df.size()
# Get the number of unique values in the "Country" column
num_countries = len(oecd_cancer_df['Country'].unique())
# Print the number of countries to the console
print("Number of countries: ", num_countries)
# Display the number of rows in each group
#print(grouped_counts)
Number of countries: 44
import plotly.express as px
# Define the list of colors for each year
colors = ['#FFA07A', '#6B8E23', '#4682B4']
# Create the plot and assign a different color to each year
variable = px.box(oecd_cancer_df, x='Variable', y='Country', color='Year', color_discrete_sequence=colors)
# Customize the plot
variable.update_traces(marker=dict(size=3), boxmean=True, jitter=0.3, line=dict(width=2))
variable.update_layout(title={'text': 'Cancer Incidence by Variable and Year', 'y':0.98, 'x':0.5, 'xanchor': 'center', 'yanchor': 'top'},
xaxis_title='Variable', yaxis_title='Country')
variable.update_layout(legend=dict(orientation='h', yanchor='top', y=1.1, xanchor='center', x=0.5))
variable.show()
import plotly.express as px
# Filter the data to exclude 'ALL' measures and variables
oecd_cancer_df2 = oecd_cancer_df[oecd_cancer_df['Measure'] != 'ALL']
oecd_cancer_df2 = oecd_cancer_df2[oecd_cancer_df2['Variable'] != 'ALL']
# Create the treemap
fig = px.treemap(oecd_cancer_df2, path=['Variable', 'Country'], values='Value',
color='Variable', hover_data=['Variable'],
color_continuous_scale='RdBu',
)
# Customize the plot
fig.update_layout(title={'text': 'Cancer Incidence by Variable and Country', 'y':0.9, 'x':0.5, 'xanchor': 'center', 'yanchor': 'top'},
coloraxis_showscale=False)
fig.show()
import pandas as pd
import plotly.express as px
# Filter the data to exclude 'ALL' measures and variables
oecd_cancer_df2 = oecd_cancer_df[oecd_cancer_df['Measure'] != 'ALL']
oecd_cancer_df2 = oecd_cancer_df2[oecd_cancer_df2['Variable'] != 'ALL']
# Calculate total incidence of cancer for each country and sort in descending order
country_totals = oecd_cancer_df2.groupby('Country')['Value'].sum().sort_values(ascending=False)
# Select the top 15 countries by total incidence
top_countries = country_totals.head(15).index.tolist()
# Filter the data to include only the top 15 countries
oecd_cancer_top15 = oecd_cancer_df2[oecd_cancer_df2['Country'].isin(top_countries)]
# Create the animated bar chart
fig = px.bar(oecd_cancer_top15, x='Country', y='Value', color='Country', animation_frame='Variable',
labels={'Value': 'Incidence'},
title='Cancer Incidence by Country and Variable Over Time',
range_y=[0, oecd_cancer_top15['Value'].max()])
# Customize the plot
fig.update_layout(xaxis_title=' ', yaxis_title='Incidence')
# Add a slider for controlling the animation
fig.update_layout(
updatemenus=[
dict(
type='buttons',
showactive=False,
buttons=[
dict(
label='Play',
method='animate',
args=[None, {'frame': {'duration': 500, 'redraw': True},
'fromcurrent': True,
'transition': {'duration': 0}}]
),
dict(
label='Pause',
method='animate',
args=[[None], {'frame': {'duration': 0, 'redraw': False},
'mode': 'immediate',
'transition': {'duration': 0}}]
)
],
x=0.5,
y=1.5
)
]
)
fig.update_layout(
sliders=dict(
active=len(oecd_cancer_top15['Variable'].unique())-1,
yanchor='top',
xanchor='left',
currentvalue=dict(
font=dict(size=16),
prefix='Variable: ',
xanchor='right',
visible=True,
),
transition=dict(duration=500),
len=0.9,
x=0.1,
y=0,
steps=[dict(
label=str(variable),
method='animate',
args=[dict(frame=dict(duration=500, redraw=True), fromcurrent=True, transition=dict(duration=0)),
dict(mode='immediate', frame=dict(duration=0, redraw=False), transition=dict(duration=0))],
) for variable in oecd_cancer_top15['Variable'].unique()]
)
)
# Adjust the x-axis range
fig.update_xaxes(range=[-0.5, len(top_countries)-0.5])
fig.show()
import plotly.express as px
# Filter the data to exclude 'ALL' measures and variables
oecd_cancer_df2 = oecd_cancer_df[oecd_cancer_df['Measure'] != 'ALL']
oecd_cancer_df2 = oecd_cancer_df2[oecd_cancer_df2['Variable'] != 'ALL']
# Remove outliers
oecd_cancer_df2 = oecd_cancer_df2[oecd_cancer_df2['Value'] <= oecd_cancer_df2['Value'].quantile(0.99)]
# Create the strip plot
fig = px.strip(oecd_cancer_df2, x='Variable', y='Value', color='Country', facet_col='Year',
labels={'Value': 'Incidence', 'Variable': 'Variable'},
title='Cancer Incidence by Country and Year')
# Customize the plot
fig.update_layout(xaxis_title='Variable', yaxis_title='Incidence', height=500)
fig.update_traces(marker=dict(size=4))
fig.update_layout(legend_title='', showlegend=False)
fig.update_layout(margin=dict(t=120, r=0, b=0, l=0))
# Add a dropdown to select the variable to show
fig.update_layout(
updatemenus=[
dict(
buttons=[
dict(
args=[{'x': [oecd_cancer_df2[oecd_cancer_df2['Variable'] == variable]['Variable']],
'y': [oecd_cancer_df2[oecd_cancer_df2['Variable'] == variable]['Value']]}],
label=variable,
method='update'
) for variable in oecd_cancer_df2['Variable'].unique()
],
direction='down',
pad={'r': 10, 't': 10},
showactive=True,
x=0.1,
xanchor='left',
y=1.1,
yanchor='top'
)
]
)
fig.show()
import plotly.express as px
# Create the density heatmap figure
fig = px.density_heatmap(oecd_cancer_df2, x='Variable', y='Country')
# Update the figure layout
fig.update_layout(
title='Cancer Rates by Variable and Country in OECD Nations',
xaxis_title='Variable',
yaxis_title='Country',
height=500,
width=800,
)
# Show the figure
fig.show()
import plotly.express as px
# Filter the data to only include the top 10 countries by incidence rate
top_10_countries = oecd_cancer_df2.groupby('Country')['Value'].sum().sort_values(ascending=False).head(10).index
filtered_df = oecd_cancer_df2[oecd_cancer_df2['Country'].isin(top_10_countries)]
# Create the pie chart figure
fig = px.pie(filtered_df, values='Value', names='Country')
# Update the figure layout
fig.update_layout(
title='Top 10 Countries by Cancer Incidence Rate',
height=600,
width=800,
)
# Show the figure
fig.show()
from sklearn.ensemble import RandomForestRegressor
# Filter the data to only include the top 10 countries by incidence rate
top_10_countries = oecd_cancer_df2.groupby('Country')['Value'].sum().sort_values(ascending=False).head(10).index
filtered_df = oecd_cancer_df2[oecd_cancer_df2['Country'].isin(top_10_countries)]
# Create a dictionary to map variable names to numerical indices
variable_mapping = {variable: index for index, variable in enumerate(filtered_df['Variable'].unique())}
# Create the 'VariableIndex' column by mapping the 'Variable' column using the dictionary
filtered_df['VariableIndex'] = filtered_df['Variable'].map(variable_mapping)
# Select only the numerical features as input data
X = filtered_df[['VariableIndex', 'Year']]
y = filtered_df['Value']
# Train a random forest regression model
model = RandomForestRegressor()
model.fit(X, y)
# Use the model to predict cancer incidence rates for each country in the filtered data
predicted_values = []
for country in top_10_countries:
for variable_index in filtered_df['VariableIndex'].unique():
for year in filtered_df['Year'].unique():
predicted_value = model.predict([[variable_index, year]])[0]
predicted_values.append((country, variable_index, year, predicted_value))
# Convert the predicted values to a Pandas DataFrame
predicted_df = pd.DataFrame(predicted_values, columns=['Country', 'VariableIndex', 'Year', 'Predicted Value'])
# Merge the filtered data with the predicted data
merged_df = pd.merge(filtered_df, predicted_df, on=['Country', 'VariableIndex', 'Year'], how='outer')
# Create the line chart figure
fig = px.line(merged_df, x='Year', y='Value', color='Country', line_group='Variable', hover_name='Variable')
# Update the figure layout
fig.update_layout(
title='Cancer Incidence Rates by Variable and Year for Top 10 Countries',
xaxis_title='Year',
yaxis_title='Cancer Incidence Rate',
height=400,
width=800,
)
# Show the figure
fig.show()
import plotly.express as px
# Filter the data to only include the top 10 countries by incidence rate
top_10_countries = oecd_cancer_df2.groupby('Country')['Value'].sum().sort_values(ascending=False).head(15).index
filtered_df = oecd_cancer_df2[oecd_cancer_df2['Country'].isin(top_10_countries)]
# Create the pie chart figure
fig = px.pie(filtered_df, values='Value', names='Country')
# Update the figure layout
fig.update_layout(
title='Top 15 Countries by Cancer Incidence Rate',
height=600,
width=800,
)
# Show the figure
fig.show()
import plotly.express as px
# Filter the data to only include the top 10 countries by incidence rate
top_10_countries = oecd_cancer_df2.groupby('Country')['Value'].sum().sort_values(ascending=False).head(15).index
filtered_df = oecd_cancer_df2[oecd_cancer_df2['Country'].isin(top_10_countries)]
# Create the bar chart figure
fig = px.bar(filtered_df, x='Country', y='Value', color='Variable', barmode='group')
# Update the figure layout
fig.update_layout(
title='Top 15 Countries by Cancer Incidence Rate',
height=600,
width=800,
xaxis_title='Country',
yaxis_title='Cancer Incidence Rate',
legend_title='Type of Cancer',
)
# Show the figure
fig.show()
import plotly.express as px
import pandas as pd
# Filter the data to only include the top 15 countries by incidence rate
top_15_countries = oecd_cancer_df2.groupby('Country')['Value'].sum().sort_values(ascending=False).head(15).index
filtered_df = oecd_cancer_df2[oecd_cancer_df2['Country'].isin(top_15_countries)]
# Create the bar chart figure
fig = px.bar(filtered_df, x='Country', y='Value', color='Variable', barmode='group')
# Update the figure layout
fig.update_layout(
title='Top 15 Countries by Cancer Incidence Rate',
height=600,
width=800,
xaxis_title='Country',
yaxis_title='Cancer Incidence Rate',
legend_title='Type of Cancer',
)
# Create the dropdown menu
dropdown_menu = []
for variable in filtered_df['Variable'].unique():
dropdown_menu.append(
{'label': variable, 'method': 'update', 'args': [
{'visible': filtered_df['Variable'] == variable},
{'title': f'Top 15 Countries by {variable} Incidence Rate'}
]}
)
# Add the dropdown menu to the layout
fig.update_layout(updatemenus=[{'type': 'dropdown', 'active': 0, 'buttons': dropdown_menu}], hovermode='closest')
# Show the figure
fig.show()
import plotly.express as px
fig = px.violin(oecd_cancer_df, x='Value', y='Variable', color='Year', box=True, points='all')
fig.update_layout(
title='Distribution of Cancer Incidence Rates by Type and Year',
xaxis_title='Cancer Incidence Rate',
yaxis_title='Type of Cancer',
legend_title='Year',
width=700,
height=500,
)
fig.show()
import matplotlib.pyplot as plt
# Define colors for each bar
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
# Group the data by Variable and calculate the mean
grouped = oecd_cancer_df.groupby('Variable')['Value'].mean()
# Create a bar chart
plt.bar(grouped.index, grouped.values, color=colors)
# Add labels and title
plt.xlabel('Variable')
plt.ylabel('Value')
plt.title('Average Value by Variable')
# Add values for each bar
for i, value in enumerate(grouped.values):
plt.text(i, value, round(value, 2), ha='center', va='bottom')
# Add a legend to the chart
#plt.legend(['Average Value'])
# Move the grid behind the bars
plt.gca().set_axisbelow(True)
plt.grid(True, linestyle='--')
# Rotate the x-tick labels by 45 degrees
plt.xticks(rotation=45)
# Adjust the size of the plot
plt.gcf().set_size_inches(9,7)
# Show the plot
plt.show()
Malignant_neoplasms =oecd_cancer_df[oecd_cancer_df['Variable'] == 'MN']
Malignant_neoplasms.head()
| Variable | Measure | Country | Year | Value | |
|---|---|---|---|---|---|
| 0 | MN | Incidence per 100 000 population | Australia | 2002 | 312.0 |
| 1 | MN | Incidence per 100 000 population | Australia | 2008 | 314.1 |
| 2 | MN | Incidence per 100 000 population | Australia | 2012 | 323.0 |
| 9 | MN | Incidence per 100 000 population | Austria | 2000 | 250.7 |
| 10 | MN | Incidence per 100 000 population | Austria | 2002 | 275.5 |
Malignant_neoplasms_Country = Malignant_neoplasms['Country'].unique()
np.count_nonzero(Malignant_neoplasms_Country)
44
unique_Variable = Malignant_neoplasms['Measure'].unique()
unique_Variable
array(['Incidence per 100 000 population',
'Incidence per 100 000 females', 'Incidence per 100 000 males',
'Number of female cases', 'Number of male cases',
'Number of total cases'], dtype=object)
Malignant_neoplasms_groupby_Measure_year = Malignant_neoplasms.groupby(by=['Measure','Year']).sum()
Malignant_neoplasms_groupby_Measure_year
| Value | ||
|---|---|---|
| Measure | Year | |
| Incidence per 100 000 females | 2000 | 3583.6 |
| 2002 | 6929.8 | |
| 2008 | 7912.8 | |
| 2012 | 9909.2 | |
| Incidence per 100 000 males | 2000 | 4549.4 |
| 2002 | 9041.8 | |
| 2008 | 10234.0 | |
| 2012 | 12671.7 | |
| Incidence per 100 000 population | 2000 | 4066.5 |
| 2002 | 7986.5 | |
| 2008 | 8869.7 | |
| 2012 | 11033.6 | |
| Number of female cases | 2000 | 758757.0 |
| 2002 | 2074716.0 | |
| 2008 | 2380411.0 | |
| 2012 | 5156553.0 | |
| Number of male cases | 2000 | 883500.0 |
| 2002 | 2460751.0 | |
| 2008 | 2889184.0 | |
| 2012 | 6063242.0 | |
| Number of total cases | 2000 | 1642257.0 |
| 2002 | 4535467.0 | |
| 2008 | 5192631.0 | |
| 2012 | 11219795.0 |
oecd_cancer_df = pd.read_csv('OECD_CANCER.csv')
oecd_cancer_df.drop(['Flag Codes', 'Flags', 'COU', 'YEA', 'UNIT', 'VAR'], axis = 1, inplace = True)
Malignant_neoplasms_colon = oecd_cancer_df[oecd_cancer_df['Variable'] == 'Malignant neoplasms of colon']
Malignant_neoplasms_colon
| Variable | Measure | Country | Year | Value | |
|---|---|---|---|---|---|
| 3 | Malignant neoplasms of colon | Incidence per 100 000 population | Australia | 2002 | 41.7 |
| 4 | Malignant neoplasms of colon | Incidence per 100 000 population | Australia | 2008 | 38.7 |
| 5 | Malignant neoplasms of colon | Incidence per 100 000 population | Australia | 2012 | 38.4 |
| 13 | Malignant neoplasms of colon | Incidence per 100 000 population | Austria | 2000 | 34.7 |
| 14 | Malignant neoplasms of colon | Incidence per 100 000 population | Austria | 2002 | 35.0 |
| ... | ... | ... | ... | ... | ... |
| 2959 | Malignant neoplasms of colon | Incidence per 100 000 population | Colombia | 2012 | 12.9 |
| 2963 | Malignant neoplasms of colon | Number of male cases | Costa Rica | 2012 | 404.0 |
| 2966 | Malignant neoplasms of colon | Incidence per 100 000 males | Latvia | 2012 | 30.0 |
| 2971 | Malignant neoplasms of colon | Number of total cases | Colombia | 2012 | 5663.0 |
| 2973 | Malignant neoplasms of colon | Incidence per 100 000 females | Lithuania | 2012 | 18.9 |
744 rows × 5 columns
unique_Measure_colon = Malignant_neoplasms_colon['Measure'].unique()
unique_Measure_colon
array(['Incidence per 100 000 population',
'Incidence per 100 000 females', 'Incidence per 100 000 males',
'Number of female cases', 'Number of male cases',
'Number of total cases'], dtype=object)
# Malignant_neoplasms_colon.drop(['Year'], axis = 1, inplace = True)
#Measure_colon_new = unique_Measure_colon
#unique_Measure_colon_groupby_Measure = Measure_colon_new.groupby(by=['Measure']).sum()
#unique_Measure_colon_groupby_Measure
# Malignant_neoplasms_groupby_Measure.values.flatten()
import matplotlib.pyplot as plt
import pandas as pd
# Group the data by Measure and calculate the sum
unique_Measure_colon_groupby_Measure = Malignant_neoplasms_colon.groupby(by=['Measure']).sum()
# Create a bar chart
plt.bar(unique_Measure_colon_groupby_Measure.index, unique_Measure_colon_groupby_Measure['Value'], color=['blue', 'green', 'red'])
# Add axis labels and a title
plt.xlabel('Measure', fontweight='bold')
plt.ylabel('Value', fontweight='bold')
plt.title('Distribution of Measures for Colon Cancer')
# Rotate the x-axis labels
plt.xticks(rotation=45)
# Show the plot
plt.show()
unique_Measure_colon_groupby_Measure_year = Malignant_neoplasms_colon.groupby(by=['Measure','Year']).sum()
unique_Measure_colon_groupby_Measure_year
| Value | ||
|---|---|---|
| Measure | Year | |
| Incidence per 100 000 females | 2000 | 411.2 |
| 2002 | 800.3 | |
| 2008 | 856.3 | |
| 2012 | 994.7 | |
| Incidence per 100 000 males | 2000 | 577.9 |
| 2002 | 1163.4 | |
| 2008 | 1313.4 | |
| 2012 | 1519.8 | |
| Incidence per 100 000 population | 2000 | 494.6 |
| 2002 | 982.7 | |
| 2008 | 1057.6 | |
| 2012 | 1227.7 | |
| Number of female cases | 2000 | 108299.0 |
| 2002 | 274929.0 | |
| 2008 | 301160.0 | |
| 2012 | 509354.0 | |
| Number of male cases | 2000 | 117921.0 |
| 2002 | 319253.0 | |
| 2008 | 362959.0 | |
| 2012 | 628072.0 | |
| Number of total cases | 2000 | 226220.0 |
| 2002 | 594182.0 | |
| 2008 | 664119.0 | |
| 2012 | 1137426.0 |
oecd_cancer_df = pd.read_csv('OECD_CANCER.csv')
oecd_cancer_df.drop(['Flag Codes', 'Flags', 'COU', 'YEA', 'UNIT', 'VAR'], axis = 1, inplace = True)
Malignant_neoplasms_lung = oecd_cancer_df[oecd_cancer_df['Variable'] == 'Malignant neoplasms of lung']
Malignant_neoplasms_lung
| Variable | Measure | Country | Year | Value | |
|---|---|---|---|---|---|
| 6 | Malignant neoplasms of lung | Incidence per 100 000 population | Australia | 2002 | 28.2 |
| 7 | Malignant neoplasms of lung | Incidence per 100 000 population | Australia | 2008 | 25.6 |
| 8 | Malignant neoplasms of lung | Incidence per 100 000 population | Australia | 2012 | 27.0 |
| 17 | Malignant neoplasms of lung | Incidence per 100 000 population | Austria | 2000 | 27.0 |
| 18 | Malignant neoplasms of lung | Incidence per 100 000 population | Austria | 2002 | 28.5 |
| ... | ... | ... | ... | ... | ... |
| 2957 | Malignant neoplasms of lung | Incidence per 100 000 females | Latvia | 2012 | 7.9 |
| 2965 | Malignant neoplasms of lung | Number of female cases | Costa Rica | 2012 | 130.0 |
| 2968 | Malignant neoplasms of lung | Number of male cases | Colombia | 2012 | 3038.0 |
| 2972 | Malignant neoplasms of lung | Incidence per 100 000 population | Lithuania | 2012 | 26.2 |
| 2975 | Malignant neoplasms of lung | Incidence per 100 000 males | Costa Rica | 2012 | 9.8 |
744 rows × 5 columns
Malignant_neoplasms_lung.drop(['Year'], axis = 1, inplace = True)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:4906: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
unique_Measure_lung_groupby_Measure = Malignant_neoplasms_lung.groupby(by=['Measure']).sum()
unique_Measure_lung_groupby_Measure
# Malignant_neoplasms_groupby_Measure.values.flatten()
| Value | |
|---|---|
| Measure | |
| Incidence per 100 000 females | 2022.8 |
| Incidence per 100 000 males | 5475.2 |
| Incidence per 100 000 population | 3641.1 |
| Number of female cases | 970824.0 |
| Number of male cases | 2060176.0 |
| Number of total cases | 3031000.0 |
import matplotlib.pyplot as plt
import pandas as pd
# Group the data by Measure and calculate the sum
unique_Measure_lung_groupby_Measure = Malignant_neoplasms_lung.groupby(by=['Measure']).sum()
# Create a bar chart
plt.bar(unique_Measure_lung_groupby_Measure.index, unique_Measure_lung_groupby_Measure['Value'], color=['blue', 'green', 'black'])
# Add axis labels and a title
plt.xlabel('Measure', fontweight='bold')
plt.ylabel('Value', fontweight='bold')
plt.title('Distribution of Measures for lung Cancer')
# Rotate the x-axis labels
plt.xticks(rotation=45)
# Show the plot
plt.show()
oecd_cancer_df = pd.read_csv('OECD_CANCER.csv')
oecd_cancer_df.drop(['Flag Codes', 'Flags', 'COU', 'YEA', 'UNIT', 'VAR'], axis = 1, inplace = True)
Malignant_neoplasms_prostate = oecd_cancer_df[oecd_cancer_df['Variable'] == 'Malignant neoplasms of prostate']
Malignant_neoplasms_prostate
| Variable | Measure | Country | Year | Value | |
|---|---|---|---|---|---|
| 969 | Malignant neoplasms of prostate | Incidence per 100 000 males | Australia | 2002 | 76.0 |
| 970 | Malignant neoplasms of prostate | Incidence per 100 000 males | Australia | 2008 | 105.0 |
| 971 | Malignant neoplasms of prostate | Incidence per 100 000 males | Australia | 2012 | 115.2 |
| 984 | Malignant neoplasms of prostate | Incidence per 100 000 males | Austria | 2000 | 49.8 |
| 985 | Malignant neoplasms of prostate | Incidence per 100 000 males | Austria | 2002 | 71.4 |
| ... | ... | ... | ... | ... | ... |
| 2926 | Malignant neoplasms of prostate | Incidence per 100 000 males | Lithuania | 2012 | 60.9 |
| 2944 | Malignant neoplasms of prostate | Incidence per 100 000 males | Latvia | 2012 | 82.7 |
| 2951 | Malignant neoplasms of prostate | Number of male cases | Costa Rica | 2012 | 1556.0 |
| 2964 | Malignant neoplasms of prostate | Number of male cases | Colombia | 2012 | 9564.0 |
| 2970 | Malignant neoplasms of prostate | Incidence per 100 000 males | Costa Rica | 2012 | 67.5 |
248 rows × 5 columns
Malignant_neoplasms_fig = Malignant_neoplasms[Malignant_neoplasms['Measure'] != 'Incidence per 100 000 males']
df_fig1 = Malignant_neoplasms_fig[Malignant_neoplasms_fig['Measure'] != 'Incidence per 100 000 females']
df_fig2 = df_fig1 [df_fig1 ['Measure'] != 'Incidence per 100 000 population']
import pandas as pd
import plotly.express as px
# Load your dataset
data = pd.read_csv("OECD_CANCER.csv")
# Filter the data as needed
df_fig1 = data[data['Measure'] != 'Number of total cases']
df_fig1 = df_fig1[df_fig1['Variable'] != 'Malignant neoplasms']
# Get the unique countries and create a custom color mapping
unique_countries = df_fig1['Country'].unique()
country_colors = px.colors.qualitative.Plotly[:len(unique_countries)]
color_map = dict(zip(unique_countries, country_colors))
# Modify the path parameter to remove the cancer type
sun_fig = px.sunburst(df_fig1, path=['Measure', 'Country'], values='Value',
color='Country', hover_data=['Variable'],
title="Total Number of female and male cases of Malignant Neoplasms",
color_discrete_map=color_map,
width=800, height=800)
sun_fig.show()
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
# Load the OECD_CANCER dataset
oecd_cancer_df = pd.read_csv('OECD_CANCER.csv')
# Select the variables to use for clustering
cluster_vars = ['Year', 'Value']
# Convert the selected variables to a numpy array
X = np.array(oecd_cancer_df[cluster_vars])
# Normalize the data using z-score normalization
X_norm = (X - X.mean()) / X.std()
# Perform K-means clustering with k=3 clusters
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_norm)
# Add the cluster labels to the original dataframe
oecd_cancer_df['Cluster'] = kmeans.labels_
# View the distribution of clusters by year
print(oecd_cancer_df.groupby(['Year', 'Cluster']).size())
Year Cluster
2000 0 378
2 6
2002 0 705
1 1
2 14
2008 0 798
1 1
2 17
2012 0 1018
1 4
2 34
dtype: int64
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Load the OECD_CANCER dataset
oecd_cancer_df = pd.read_csv('OECD_CANCER.csv')
# Select the variables to use for clustering
cluster_vars = ['Year', 'Value']
# Convert the selected variables to a numpy array
X = np.array(oecd_cancer_df[cluster_vars])
# Normalize the data using z-score normalization
X_norm = (X - X.mean()) / X.std()
# Perform K-means clustering with k=3 clusters
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_norm)
# Add the cluster labels to the original dataframe
oecd_cancer_df['Cluster'] = kmeans.labels_
# Group the data by year and cluster and compute the count of data points in each group
cluster_counts = oecd_cancer_df.groupby(['Year', 'Cluster']).size().reset_index(name='Count')
# Reshape the data into a pivot table format
cluster_pivot = cluster_counts.pivot(index='Year', columns='Cluster', values='Count')
# Create a stacked bar plot of the data
ax = cluster_pivot.plot(kind='bar', stacked=True, colormap='viridis')
plt.xlabel('Year')
plt.ylabel('Count')
plt.title('Distribution of Clusters by Year')
plt.show()
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Read the dataset
df = oecd_cancer_df
# Filter the data for specific cancer types
selected_cancer_types = ['MN of colon', 'MN of lung', 'MN of cervix', 'MN of prostate']
df = df[df['Variable'].isin(selected_cancer_types)]
# Pivot the data to have a multi-level column index with Variable and Measure
df_pivot = df.pivot_table(index=['Country', 'Year'], columns=['Variable', 'Measure'], values='Value').reset_index()
# Calculate correlations between different types of cancer
correlations = df_pivot.corr()
# Customize the labels for a clearer plot
def clean_label(label):
label = label.replace("Incidence per 100 000 males", "M")
label = label.replace("Incidence per 100 000 females", "F")
label = label.replace("Incidence per 100 000 population", "P")
label = label.replace("Number of female cases", "F_C")
label = label.replace("Number of total cases", "T_C")
label = label.replace("Number of male cases", "M_C")
return label
labels = [clean_label(col[0] + ' - ' + col[1]) for col in correlations.columns]
correlations.columns = labels
correlations.index = labels
# Plot the heatmap of correlations
plt.figure(figsize=(14, 12))
sns.set(font_scale=1.2)
sns.heatmap(correlations, annot=True, fmt='.2f', cmap="coolwarm", square=True, linewidths=0.5, cbar_kws={'shrink': .8})
plt.title("Correlation Between Different Types of Cancer", fontsize=18)
plt.show()
The output of the code is a heatmap that displays the correlation between different types of cancer. Each cell in the heatmap represents the correlation between two types of cancer or the same cancer type. The color of each cell represents the strength and direction of the correlation. A correlation coefficient (r) ranges between -1 and 1.
Positive correlation (r > 0): As one variable increases, the other variable also increases. The strength of the positive correlation increases as the value approaches 1. In the heatmap, these correlations are represented by warmer colors (reds and oranges).
Negative correlation (r < 0): As one variable increases, the other variable decreases. The strength of the negative correlation increases as the value approaches -1. In the heatmap, these correlations are represented by cooler colors (blues and greens).
No correlation (r = 0): There is no relationship between the two variables. In the heatmap, this is represented by a neutral color (white).
The diagonal line of cells with a correlation coefficient of 1 are the correlations of a variable with itself. For example, the correlation between "MN of colon - Males" and "MN of colon - Males" is 1 because they are the same variable.
To explain this to your students, you can say that the heatmap shows the relationship between different types of cancer. Warmer colors represent a positive correlation, meaning that when the incidence rate of one cancer type increases, the other cancer type's incidence rate also increases. Cooler colors represent a negative correlation, meaning that when the incidence rate of one cancer type increases, the other cancer type's incidence rate decreases. Neutral colors (white) represent no correlation between the two cancer types.
It is important to note that correlation does not imply causation. A correlation between two variables indicates a relationship, but it does not necessarily mean that one variable causes the other.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress warnings
warnings.filterwarnings("ignore")
# Read the dataset
df = oecd_cancer_df
# Group by Variable and calculate the mean for each cancer type
mean_incidence = df.groupby(['Variable', 'Country'])['Value'].mean().reset_index()
# Pivot the data to have a column for each cancer type
mean_incidence_pivot = mean_incidence.pivot_table(index='Country', columns='Variable', values='Value').reset_index()
# Remove the multi-level column index
mean_incidence_pivot.columns.name = None
mean_incidence_pivot = mean_incidence_pivot.reset_index(drop=True)
# Plot a pairplot to show the relationships between different cancer types
sns.set(style="ticks", font_scale=1.2)
pairplot = sns.pairplot(mean_incidence_pivot, diag_kind="kde", markers="+",
height=2.5, aspect=1.2,
plot_kws=dict(s=50, edgecolor="r", facecolor="r", linewidth=1),
diag_kws=dict(shade=True))
pairplot.fig.suptitle("Correlation Between Different Types of Cancer (Mean Incidence Rates)", y=1.03, fontsize=19)
plt.show()
The output plot is a pair plot, which is a matrix of scatter plots that helps visualize the relationships between different variables. In this case, the variables are the mean incidence rates of different types of cancer in various countries. The primary goal of this plot is to explore the correlation between the different types of cancer and identify any trends or patterns in the data.
Here's an explanation of the plot that you can provide to your student:
Each scatter plot within the matrix represents a pair of cancer types, with one type plotted on the x-axis and the other on the y-axis. Each point on a scatter plot represents a country, with its mean incidence rate for the two types of cancer.
The diagonal plots are kernel density estimates (KDE), which provide a smoothed, continuous visualization of the distribution of the mean incidence rates for each cancer type across the countries. These plots can help identify the general shape of the distribution, such as whether it is unimodal, bimodal, or skewed.
If there's a strong positive correlation between two types of cancer, the points in the scatter plot will form an upward-sloping pattern, indicating that as the incidence rate of one cancer type increases, the other cancer type's incidence rate tends to increase as well. Conversely, a strong negative correlation will have the points forming a downward-sloping pattern, meaning that as the incidence rate of one cancer type increases, the other cancer type's incidence rate tends to decrease. If there's little to no correlation, the points will be scattered with no clear pattern.
In addition to identifying correlations, the pair plot can also reveal any potential outliers or unusual data points. Outliers may indicate issues with the data or unique situations in specific countries that warrant further investigation.
Encourage your student to examine the pair plot and identify which pairs of cancer types have strong correlations, weak correlations, or no correlations. Additionally, discussing the KDE plots and any noticeable outliers can lead to a deeper understanding of the data and the relationships between different cancer types.